Research: Cloud Demo Infrastructure
Feature: 007-cloud-demo-infra Date: 2026-02-15
Technology Decisions
1. Hetzner Cloud Provider
Decision: Use Hetzner Cloud with official Terraform provider
Rationale: - 5x cheaper than DigitalOcean/Vultr for equivalent specs (~β¬0.007/hr vs $0.036/hr per VM) - US West (Hillsboro) datacenter available for west coast latency - Rocky Linux 9 available as base image (matches spec 006 Vagrant demo) - Terraform provider is mature and well-documented (hetznercloud/hcloud) - API supports all required features: VMs, private networks, SSH keys, labels
Alternatives Considered: - DigitalOcean: Excellent Terraform support but 5x more expensive - Vultr: Similar pricing to DO, less mature Terraform provider - AWS/GCP: Overkill for demo use case, complex IAM setup
2. Terraform for Infrastructure
Decision: Use Terraform with local state (optional Terraform Cloud)
Rationale: - Industry standard IaC tool, aligns with constitution principle VIII - Hetzner provider is first-party maintained - Local state is sufficient for single-user demo workflow - State file enables cluster detection (FR-006a: block duplicate spin-up) - Terraform Cloud option for team usage without additional complexity
Alternatives Considered: - Pulumi: More complex, requires additional runtime - Ansible cloud modules: Less idiomatic for provisioning, better for config - Hetzner CLI (hcloud): Not declarative, harder to ensure idempotency
3. Dynamic Inventory Generation
Decision: Terraform template file generates Ansible inventory YAML
Rationale:
- Terraform local_file resource with templatefile() function
- Inventory format matches spec 006 structure (groups: mgmt, login, compute)
- IPs injected from Terraform outputs, no manual editing
- Inventory file path: infra/terraform/inventory.yml (gitignored)
Alternatives Considered: - Ansible dynamic inventory script: Additional complexity, another tool - Manual inventory: Error-prone, defeats automation purpose - Terraform output JSON + jq: Extra step in workflow
4. SSH Key Auto-Detection
Decision: Check for ~/.ssh/id_ed25519.pub first, fall back to ~/.ssh/id_rsa.pub
Rationale:
- Ed25519 is modern, recommended default
- RSA fallback for legacy key users
- Error with clear message if neither exists
- Environment variable override: DEMO_SSH_KEY for non-standard paths
Alternatives Considered: - Require explicit path: More friction for common case - Generate ephemeral key: Security concern, key management burden - Use Hetzner-managed SSH keys: Requires pre-registration in console
5. TTL Warning Implementation
Decision: Check creation timestamp on each make target, warn if exceeded
Rationale:
- Terraform resource labels store creation timestamp
- Each wrapper script checks hcloud server list --selector for TTL
- Warning printed to stderr, does not block operation
- No external services required (no Slack/email webhooks)
Alternatives Considered: - Background daemon: Complex, may not be running when user returns - Email/webhook notification: External dependency, setup friction - Auto-teardown: Too aggressive, could interrupt active demo
6. Cluster Detection (Single Cluster Enforcement)
Decision: Check Terraform state for existing resources before spin-up
Rationale:
- terraform state list returns empty if no cluster exists
- Non-empty state blocks spin-up with clear message
- User must run teardown to proceed
- Prevents orphaned resources and billing surprises
Alternatives Considered: - Allow multiple clusters with unique names: Increases cost risk, complexity - Auto-teardown before spin-up: Could destroy active demo accidentally - Hetzner API check only: State file is more reliable for our workflow
7. VM Sizing
Decision: Use Hetzner CPX series (shared vCPU, cost-optimized)
Rationale: - mgmt01: CPX21 (3 vCPU, 4GB) - FreeIPA, Wazuh, Slurm controller, NFS - login01: CPX11 (2 vCPU, 2GB) - SSH gateway, Slurm submit - compute01/02: CPX11 (2 vCPU, 2GB) - Slurm workers
Cost breakdown (US West, per hour): - CPX21: β¬0.0119/hr - CPX11: β¬0.0059/hr Γ 3 = β¬0.0177/hr - Network: β¬0.00 (private network free) - Total: ~β¬0.03/hr (~$0.03/hr)
Alternatives Considered: - Dedicated vCPU (CCX): 3x more expensive, not needed for demos - Smaller instances: FreeIPA needs 4GB minimum for stability - Larger instances: Unnecessary for demo workloads
8. Network Architecture
Decision: Private network 10.0.0.0/24, public IPs on mgmt01 and login01 only
Rationale: - Private network for inter-node communication (NFS, Slurm, FreeIPA) - Public IP on mgmt01 for admin access - Public IP on login01 for workshop attendee SSH access - Compute nodes internal-only (realistic HPC topology) - Matches spec 006 Vagrant network design
Alternatives Considered: - All public IPs: Unnecessary exposure, additional cost - Single public IP + bastion: Extra hop, latency for demos - IPv6 only: Not universally accessible from all networks
9. OS Image
Decision: Rocky Linux 9 (Hetzner standard image)
Rationale: - Matches spec 006 Vagrant demo for consistency - RHEL-compatible, required for FreeIPA, Wazuh, compliance roles - Hetzner maintains official Rocky 9 image - Aligns with constitution target OS (RHEL 9 / Rocky Linux 9)
Alternatives Considered: - AlmaLinux 9: Equivalent, but Rocky is spec 006 standard - Fedora: Not enterprise-stable, shorter support lifecycle - Ubuntu: Different package ecosystem, would require playbook changes
10. Wrapper Script Design
Decision: Bash scripts calling Terraform and Ansible with progress output
Rationale:
- demo-cloud-up.sh: terraform apply β generate inventory β ansible-playbook provision.yml
- demo-cloud-down.sh: confirm prompt β terraform destroy
- Colored output with status messages (matching spec 006 demo scripts)
- Exit codes: 0=success, 1=terraform fail, 2=ansible fail, 3=validation fail
Alternatives Considered: - Makefile only: Less flexibility for progress output and validation - Python wrapper: Overkill for simple orchestration - Direct terraform/ansible: User must remember multiple commands